project discription

Gene expression is the process through which cells generate diversity and resilience to environmental changes. It consists of two steps: transcription, where an intermediate molecule (mRNA) is produced from the DNA, and translation of mRNA into protein. The rate of transcription and translation can be regulated by regulatory elements acting in both cis and trans. 3’UTR, 5’UTR, and promoter are known as powerful regulatory processes that determine the rate of mRNA synthesis and decay.

In this project, we are interested in using these cis-regulatory features to predict multidimensional output that indicates how the mRNA abundance change in multiple environmental conditions. Since our goal is to quantify the effect of different elements, we should avoid clustering and go straight from element to expression pattern. Linear models with multidimensional lasso (known as an algorithm of multi-task learning) which is already implemented in the glmnet package should be a good start. As an extension, we may apply to multiple states of mRNA processing like splicing, decay, and translation by combining multiple data sets. Finally, we can carry out a quantitative analysis of the contribution of motifs in different types of CRE and ask which regions make more of a contribution to dynamic changes in gene expression.

Index

Part 1: setup and investigate the dataset

After loading all the required library and function, we investigate and evaluate the dataset we have.

Load library and function

library(data.table)
## Warning: package 'data.table' was built under R version 3.6.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
library(lmodel2)
library(splitstackshape)
library(gridExtra)
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
library(impute)

Construct the design matrix

Expression level change of 6152 distinct genes under 173 different environmental stresses from Gasch’s study

In this project, we imported the data describes gene expression level in various environment condition from Gasch’s study where DNA microarrays were applied to measure the relative abundance of mRNA (relative transcript level) before and after a series of environmental stresses. The dataset consists of 6152 rows and 176 columns where each row represents a gene of yeast. The first 3 columns provide the details of the gene and the other 173 columns describe the normalized, background-corrected log2 values expression level change under 173 different environmental stress.

The 69 candidate sequence motifs we use in this project is from 3 different studies:

53 candidate sequence motifs came from the study of Shalgi et al (2005), where motifs are derived by analyzing the exhaustively enumerating all k-mers(k = {8,9,10,11,12}) and looking for over-represented motifs with extreme half-life value. 515 significant k-mers were selected with ranksum test and clustered by ClustalW which resulted in 51 clusters of motifs. Additionally, two more motifs were found by grouping genes with extreme half-life and ran Gibbs sampler.

14 candidate sequence motifs came from the study of Hogan et al. (2008). Two related computational methods were applied to identify candidate binding sites of 40 out of the more than 500 known and predicted RBPs in S. cerevisiae: (1)“finding informative regulatory elements” (FIRE) and (2)“relative filtering by nucleotide enrichment” (REFINE). The former searches for motifs with informative patterns of enrichment. The other one identifies all hexamers that are significantly enriched in untranslated regions, filters out regions of target sequences that are relatively devoid of such hexamers, and then applies the “multiple expectation maximization for motif elicitation” (MEME) motif-finding algorithm. As a result, 14 motifs that are likely to be the binding sites of 16 RBP are found.

The rest 4 were from the study of Cheng (2017): four mRNA-stability related motifs in the 3′ UTR were found by De novo motif searching and their reliability and effect have been examined by linear mixed effect model, Fisher test P-value corrected with Benjamini–Hochberg, Wilcoxon ranksum test and multivariate linear regression.

UTR sequence of 4388 genes from Cheng et al. 

The majority of 3’UTR data came from the research of Cheng et al. Annotated transcript sequences data of 4284 genes out of 6152 genes were downloaded from the GitHub sites introduced in their paper.

1868 gene expression level data loss after inner join above data by gene name.

After inner joining the expression level of 6152 genes with the 4388 motif frequency data which computed from 3’UTR data of Cheng’s study, expression level data of 1868 genes loss.

## [1] "number of missing 3'UTR sequence 1868"

apply fix-length downstream sequence from fungiDB as 3’UTR for 1320 of the 1868 genes.

First, we checked the mean length of 3’UTR from Cheng et al.

## [1] "median length of 3'UTR from Cheng et al. =  119"
## [1] "mean length of 3'UTR from Cheng et al. =  144.873290793072"

Since there is no reliable 3’UTR annotation in most of the online database, we applied downstream sequence with fix-length (120 nucleotides) as 3’UTR. The length of 120 nucleotides is considered as a sensible number close to the median length.

# load UTR sequence from fungiDB
UTR_fungiDB = read_csv("../data/3UTR_fungiDB")
## Parsed with column specification:
## cols(
##   genename = col_character(),
##   UTR3_seq = col_character()
## )
UTR_Cheng_fungiDB = rbind(UTR_cheng,UTR_fungiDB,fill=TRUE)

names_gene_cheng = UTR_cheng$genename
names_gene_fungiDB = UTR_fungiDB$genename

Construct the design matrix with additional data from fungiDB

# get UTR sequence from both dataset

ref_motifs <- tibble(geneName = UTR_Cheng_fungiDB$genename)
for (i in 1:length(motifs)){
  ref_motifs <- mutate(.data = ref_motifs,!!motifs_raw[i] := str_count(UTR_Cheng_fungiDB$UTR3_seq, motifs[i]))
}


# merge motifs
fullTable_Gasch <- merge(expressionLevel_Gasch,ref_motifs,by = "geneName")

check the motif frequency from both dataset “cheng” and “fungiDB”

fullTable_Gasch_original = fullTable_Gasch[fullTable_Gasch$geneName %in% names_gene_cheng,]
fullTable_Gasch_fungiDB = fullTable_Gasch[fullTable_Gasch$geneName %in% names_gene_fungiDB,]


summary_motif_frequency = data.frame(matrix(ncol = 3, nrow = 0))

for(m in names_motifs_all){
  summary_motif_frequency = rbind(summary_motif_frequency, data.frame(
    motif = m,
    dataset = "Cheng",
    frequency = sum(fullTable_Gasch_original[[m]]),
    frequency_per_gene = sum(fullTable_Gasch_original[[m]])/(dim(fullTable_Gasch_original)[1])
  ))
  summary_motif_frequency = rbind(summary_motif_frequency, data.frame(
    motif = m,
    dataset = "fungiDB",
    frequency = sum(fullTable_Gasch_fungiDB[[m]]),
    frequency_per_gene = sum(fullTable_Gasch_fungiDB[[m]])/(dim(fullTable_Gasch_fungiDB)[1])
  ))
}

compare motif frequency of 120 downstream and actual 3’UTR

Construct Motif Frequencies Matrix from 3’UTR Ref sequences
ref_motifs = ref_motifs[ref_motifs$geneName %in% ref_motifs_120downstream$geneName,]
plot the frequency per gene

Construct the design matrix with additional UTR sequence data from Samuel

additional 503 downstream sequences from Samuel’s lab dataset are added to construct the design matrix

UTR_Samuel <- read.csv(file="../data/genome_wide_120nt_3UTR_sequence.csv")
colnames(UTR_Samuel) = c("genename","UTR3_seq")

UTR_Cheng_fungiDB_Samuel = rbind(UTR_Cheng_fungiDB,UTR_Samuel,fill=TRUE)
# get UTR sequence from both dataset
UTR_3 <- UTR_Cheng_fungiDB_Samuel$UTR3_seq

# Search and add frequency of each c(motif) as a column in ref dataset
ref_motifs <- tibble(geneName = UTR_Cheng_fungiDB_Samuel$genename)
for (i in 1:length(motifs)){
  ref_motifs <- mutate(.data = ref_motifs,!!motifs_raw[i] := str_count(UTR_3, motifs[i]))
}


# merge motifs
fullTable_Gasch <- merge(expressionLevel_Gasch,ref_motifs,by = "geneName")

8 motifs are removed from list considering their low frequency on 3’UTR

We matching the 69 motifs on every gene sequence to compute the frequency matrix which could be used as the design matrix of our models. The distribution of the frequency of each motif across all the genes is very unbalanced. 8 motifs have never been investigated from any of the 3’UTR from the study of Cheng et al. Therefore, we temporarily remove them.

Frequency number of motifs (n)
n = 0 7
0 < n < 6 8
5 < n < 21 7
n > 20 47

One possible explanation of the huge amount of low-frequency motifs is that it is hard to match exactly the regular expression to the UTR sequence. Another point is, the secondary-structure motifs are more likely to be found on 5’UTR rather than 3’UTR. Also, a motif with a significant effect on regulation is reasonable to have a small frequency.

Part 2: Assess the rationality of group lasso

Before applying multi-task learning, we would like to investigate the data and assess if it is reasonable to apply group lasso on them.

4 significant motifs on heat-shock-relevant environmental stress.

To investigate how the expression level changes linearly depends on the motifs, we calculate the Pearson correlation coefficient (PCC) between them. Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y: \[ p_{x,y}= {cov(X,Y)\over \sigma_X \sigma_Y} \] We start by picking a group of environmental conditions about Heat Shock from Various Temperatures to 37°C. Four motifs show a much higher correlation coefficient rather than the others, which, indicate the linear relationship between these motifs and expression level. One point worth mentioning is that all these motifs are also focused by Abhi where TGTATAWT is explained to be positively associated with RNA stability and the rest three of them are explained to be negatively associated with the RNA stability. However, Abhi is not quite confident with the conclusion of UGUAHMNUA as the negative relationship can only be derived from Chan’s data but not Sun’s data.

names_temperature_condition = c(
  "hs_17to37_20min",
  "hs_21to37_20min",
  "hs_25to37_20min",
  "hs_29to37_20min",
  "hs_33to37_20min"
)

9 heat-shock-relevant motifs are selected from linear models

Before fitting the linear model, we temporarily remove the motifs with a total frequency lower than 5 since we are not confident enough to tell whether a low-frequency motif depends linearly on the heat-shock related environment condition. Then, we fitted the linear model with the lm() function with default least squares method: \[ \min_w \sum (w*e)^2 \]

## [1] "mean square error: 1.9738011943521"

9 heat-shock-relevant motifs are selected from the linear model with the following coefficient:

Motif Estimate Std.Error t value Pr(>|t|)
UGUAHMNUA -1.054 0.085 -12.369 <2e-16
ATATTC 0.309 0.057 5.432 1.54e-14
TGTATAWT -0.468 0.095 -4.907 5.07e-10
WUUGUAWUWU -0.534 0.157 -3.413 0.000207
UAAUAAUW -0.250 0.083 -3.025 0.106042
TGTAAATA 0.244 0.106 2.308 0.000349
AAAATAAAG -0.568 0.281 -2.018 0.022646
UUUAAUGA 0.302 0.162 1.864 0.000394
TCATGTAT -0.327 0.207 -1.580 0.1141

This table is sorted by Pr(>|t|) (also known as the p-value) which is the probability of achieving a |t| as large as or larger than the observed absolute t value if the null hypothesis (estimate = 0) was true. In short, it is the probability that there is no relationship between that feature and the predicted value (Ronald L. Wasserstein et al., 2016). Estimate shows the coefficient or weight gains by the responding features in this linear model. Std.Error is the standard error of the estimate value, which, can be used to calculate the Confidence interval. For example, the 95% confidence interval could be obtained from Estimate +- 1.96*Std.Error. t value is calculated from the estimates divided by their standard errors. To make the best use of this value, we need to look up the table of t distribution to learn the reject boundary. In this case, we could simply say the larger the magnitude of the t-value is, the less likely that the coefficient is 0.

Identify potential relationships between tasks

To penalty the irrelevant features and ensure the comparison across the models, we plan to apply group lasso. However, before applying the multi-task learning method, we want to ensure there are potential relationships between tasks.

names_motifs_temperature_significant = c('UGUAHMNUA','TGTATAWT','ATATTC','TGTAAATA')

names_temperature_condition = c(
  "hs_15min_hs-1",
  "hs_30min_hs-1",
  "hs_40min_hs-1",
  "hs_60min_hs-1",
  "hs_80min_hs-1"
)

# pick a group of temperature condition
names_temperature_condition = c(
  "29C(1M_sorbitol)~33C(1M_sorbitol)_05min",
  "29C(1M_sorbitol)~33C(1M_sorbitol)_15min",
  "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
)

names_temperature_condition = c(
  "hs_37to25_15min",
  "hs_37to25_30min",
  "hs_37to25_45min",
  "hs_37to25_60min",
  "hs_37to25_90min"
)

Part 3: apply group lasso

introduction

To investigate and quantify how the cis-regulatory element affects the gene regulation, we built a series of linear models where each model takes the frequencies of cis-regulatory elements as input and predicts how the gene expression level under certain environmental stress. Their coefficient can be used to estimate the regulatory effect of both a single cis-regulatory element or all the cis-regulatory as a group. For example, if an element is given a relatively large coefficient in a model which predicts the change of expression level under certain environmental stress, we can say the element play a relatively important role in the regulation under that environmental stress. On the other hand, the bias of the model can be considered as the expected gene expression level change of a gene with no cis-regulatory element we considered in the model. In other words, the bias indicates the expression level change that cannot be explained by the model and can be used to estimate the proportion of the efforts on regulation these cis-regulatory elements make.

At this stage, our aim is to apply group lasso as a multi-task learning method to training all the models together which allows the comparison across each model and ensures their generalization. The L2 penalty term can filter out most of the irrelevant factors by penalizing their coefficient towards zero.

The penalty on the coefficient vector for variable j is: \[ (1-\alpha)/2||\beta_j||_2^2+\lambda||\beta_j||_2. \]

distribution of missing value

Above all the expression level data, around 3% of them are missing value, Where most of the genes (94.8%) have a relatively complete expression levels’ data (90%) while 41 genes (0.7%) have missing valves under more than quarter of environmental stresses.

number of genes missing value (n)
755 (12.3%) 0
4372 (71.1%) 0 < n < 9 (5%)
701 (11.4%) 8 < n < 18 (10%)
283 (4.6%) 17 < n < 44 (25%)
41 (0.7%) 43 < n < 99 (58%)

The 41 genes with missing values larger than 43 and smaller than 99

names_genes = expressionLevel_Gasch$geneName
names_genes[na_count_sum > 43 & na_count_sum <99]
##  [1] "YAL058C-A" "YAR066W"   "YBR003W"   "YBR151W"   "YBR153W"   "YBR155W"  
##  [7] "YBR157C"   "YBR159W"   "YDL036C"   "YDL130W"   "YDL206W"   "YDL207W"  
## [13] "YDL208W"   "YDL209C"   "YDL210W"   "YDL211C"   "YDL212W"   "YDL213C"  
## [19] "YDL214C"   "YDL216C"   "YDL217C"   "YDR430C"   "YDR434W"   "YEL076W-C"
## [25] "YER066C-A" "YFL013W-A" "YHR149C"   "YIL015C-A" "YIL066C"   "YIL067C"  
## [31] "YIL069C"   "YIL070C"   "YIL072W"   "YIL075C"   "YIL076W"   "YJL012C"  
## [37] "YJL150W"   "YJL194W"   "YML084W"   "YMR117C"   "YOR304C-A"

On the other hand, the distribution of missing value by 173 environmental stress indicates that there are 14 environmental stress that have a relatively higher missing value ratio (larger than 10% but smaller than 23%)

number of environmental stresses missing value (n)
0 0
55 (31.8%) 0 < n < 62 (1%)
97 (56.1%) 61 < n < 308 (5%)
7 (4.0%) 307 < n < 616 (10%)
14 (8.1%) 615 < n < 1385 (23%)

the 14 environmental stresses with missing value ratio larger than 10% are listed here

na_count_sum[na_count_sum>315]
##                           hs_05min_hs-1                           hs_10min_hs-1 
##                                     893                                     704 
##                           hs_15min_hs-1                           hs_20min_hs-1 
##                                     500                                     780 
##                           hs_30min_hs-1                           hs_40min_hs-1 
##                                     783                                    1305 
##                           hs_60min_hs-1                           hs_80min_hs-1 
##                                     726                                    1384 
##                0.32mM_H2O2_40min_rescan                  2.5mM_DTT_005min_dtt-1 
##                                     509                                    1002 
##                  2.5mM_DTT_015min_dtt-1                  2.5mM_DTT_030min_dtt-1 
##                                    1162                                    1313 
##                  2.5mM_DTT_045min_dtt-1                  2.5mM_DTT_060min_dtt-1 
##                                     597                                     938 
##                  2.5mM_DTT_090min_dtt-1                  2.5mM_DTT_180min_dtt-1 
##                                     591                                     578 
##           YPD_stationary_phase_1d_ypd-1          YPD_stationary_phase_28d_ypd-1 
##                                     723                                    1043 
## steady state 36 dec C ct-2 (repeat hyb) 
##                                     697

filter out gene with high value-missing ratio (>10%)

The gene with high missing value could be caused by low abundance of mRNA in the cell which means there could be large errors for the measurements on these gene. It is better to remove them first.

na_count_sum = rowSums(is.na(expressionLevel_Gasch[4:176]))
expressionLevel_Gasch_filtered = expressionLevel_Gasch[na_count_sum<18,]

fill missing data by imputation

Applying imputation to estimate the missing values should be a much better way rather than simply ignore the missing values or replacing them by mean or 0 since it reduce the bias. Olga Troyanskaya et al.(2001) suggested that weighted K-nearest neighbors (KNNimpute) could be another sensitive method to deal with the missing value.

According to their article, when we consider gene A that has one missing value under environmental stress X, we could look for how the expression level of gene A change in other environmental stress Y and looking for k other genes B that show similar behavior as gene A. Then we fill the missing value by the weighted average of genes B. The contribution of each gene is weighted by similarity of its expression to that of gene A.

we can estimate the performance of imputation by randomly knocking out 0.1% of non-NA and check the mean square error between estimation and actual value.

an optimal value for k should sit between 5 and 20. In this case, we pick k = 11 for our imputation as a conserved and high-performance hyperparameter.

baseline performance: simply replace missing value with mean of column or 0

NA to column mean

As one of the easiest approach to deal with the missing values, we could simply replace them by mean of that column To estimate of the performance, we randomly knock out 0.1% of non-NA and check the mean square error.

## [1] "estimate mse =  0.660283266661971"
## [1] "estimate mse =  0.583398660384386"
## [1] "estimate mse =  0.6700609834903"
## [1] "estimate mse =  0.577084282879705"
## [1] "estimate mse =  0.586884694563133"
## [1] "estimate mse =  0.614764595154343"
## [1] "estimate mse =  0.668870637824216"
## [1] "estimate mse =  0.649682225303615"
## [1] "estimate mse =  0.747276988365045"
## [1] "estimate mse =  0.710807313618546"
NA to zero
## [1] "estimate mse =  0.674323564435565"
## [1] "estimate mse =  0.582346131868133"
## [1] "estimate mse =  0.680688343656345"
## [1] "estimate mse =  0.584702489510491"
## [1] "estimate mse =  0.586469764235765"
## [1] "estimate mse =  0.626297002997003"
## [1] "estimate mse =  0.685989976023975"
## [1] "estimate mse =  0.6492045004995"
## [1] "estimate mse =  0.757429727272728"
## [1] "estimate mse =  0.712692274725275"
compare the between performance of kNN imputation, NA to column mean and NA to zero
ggplot() +
  geom_point(data = baseline_df, aes("NA to column mean", mse)) +
  geom_point(data = baseline2_df, aes("NA to zero", mse)) +
  geom_point(data = k_value_validation_df[k_value_validation_df$k == 11,], aes("kNN imputation (k=11)",mse)) +
  geom_point(data = baseline_ds, aes("NA to column mean", mean), colour = 'red', size = 3) +
  geom_point(data = baseline2_ds, aes("NA to zero", mean), colour = 'red', size = 3) +
  geom_point(data = k_value_validation_ds[k_value_validation_ds$k == 11,], aes("kNN imputation (k=11)", mean), colour = 'red', size = 3) +
  geom_errorbar(
    data = baseline_ds,
    aes("NA to column mean", mean, ymin = mean - sd, ymax = mean + sd),
    colour = 'red',
    width = 0.2
  ) + 
  geom_errorbar(
    data = baseline2_ds,
    aes("NA to zero", mean, ymin = mean - sd, ymax = mean + sd),
    colour = 'red',
    width = 0.2
  ) +
  geom_errorbar(
    data = k_value_validation_ds[k_value_validation_ds$k == 11,],
    aes("kNN imputation (k=11)", mean, ymin = mean - sd, ymax = mean + sd),
    colour = 'red',
    width = 0.2
  ) +
  theme(axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold")) +
  labs(x = "")+
  ylim(0,0.8)

consider only motifs with non-zero frequency

motifs_count_sum <- colSums(fullTable_Gasch[names_motifs_all],na.rm=TRUE)
remove_list = NULL

for (i in names(motifs_count_sum)){
  if (motifs_count_sum[[i]] == 0){
    remove_list <- c(remove_list,i)
  }
}

names_motifs_valid <- names_motifs_all[!names_motifs_all %in% remove_list]

choose lambda from 10-fold cross validation

We need to set the hyperparameter lambda to control the effect of penalization. A general approch to this is by cross validation.

cv_models_all = cv.glmnet(x, y, family = "mgaussian")
plot(cv_models_all)

lambda.min is the value of lambda that gives minimum cvm; lambda.1se is the largest value of lambda such that error is within 1 standard error of the minimum; The confidence intervals represent error estimates for the loss metric (red dots). They’re computed using cross validation. The vertical dashed lines show the locations of λmin and λ1se. The numbers across the top are the number of nonzero coefficient estimates. The error is accumulated, and the average error and standard deviation over the folds is computed.

training model with selected lambda

training model with λ = lambda.min results in a model with highest expected performancec while training model with λ = lambda.1se results in a more conservative model with fewer non-zero regression coefficients.

# setting lambda manually when eval_crossValidation = False
lambda.min = 0.0334379510944218
lambda.1se = 0.235898134217399

models_all_minLambda = glmnet(x, y, family = "mgaussian", lambda = lambda.min)
models_all_1seLambda = glmnet(x, y, family = "mgaussian", lambda = lambda.1se)
models_all_minLambda = glmnet(x, y, family = "mgaussian", lambda = cv_models_all$lambda.min)
models_all_1seLambda = glmnet(x, y, family = "mgaussian", lambda = cv_models_all$lambda.1se)

paste("lambda.min =",cv_models_all$lambda.min)
paste("lambda.1se =",cv_models_all$lambda.1se)

the penalize effect with λ = lambda.1se is much stronger than that with λ = lambda.min

We summary coefficients from all the models together to compare the penalized effect of models with different λ values. The model with λ = lambda.1se penalizes about 80% of coefficients to 0 while the model with λ = lambda.min keep more than half of its coefficients non-zero.

regulatory effect of motifs peaks at different times and decrease as time pass

By extracting the regression coefficients given to motifs and sorting them with the previous grouping, we found that the regression coefficients of most motifs show a tendency to rise first, peak in a certain period and then fall. It describes the process of the regulation effect which gradually increases and returns to a stable state after the environmental stress. The time when the regulation effect reaches its peak varies with the type of environmental stress. For example, undereat shock-related environmental stress, the regression coefficients of most motifs peaked in about 15 minutes. while under diamide treatment, the regression coefficient of motifs generally reaches its peak after 30 minutes. It is worth mentioning that the degree of environmental stress has no obvious effect on the peaking period. As shown in Figure 5-1 and Figure 5-3, for most of the motifs, both their regression coefficients in heat shock from 25°C to 37° C and heat shock from 29°C to 33° Cpeaked at around 15 minutes. On the other hand, even under the same environmental conditions, not all the regression coefficients peak at the same period. This may because of the complex regulatory mechanism within the cell. While the cell adjusts its gene expression level, the number of regulatory elements (as products of the gene expression) such as RNA binding proteins that participate in regulation also changes, and lead to a change in the regulatory effect of motifs change as well.

names_temperature_condition = c(
  "hs_05min_hs-1",
  "hs_10min_hs-1",
  "hs_15min_hs-1",
  "hs_30min_hs-1",
  "hs_40min_hs-1",
  "hs_60min_hs-1",
  "hs_80min_hs-1"
)

plot_models_line(models_all_minLambda, names_temperature_condition, 10)

names_temperature_condition = c(
  "hs_29to33_05min",
  "hs_29to33_15min",
  "hs_29to33_30min"
)

plot_models_line(models_all_minLambda, names_temperature_condition, 10)

names_diamide_condition = c(
  "1.5mM_diamide_5min",
  "1.5mM_diamide_10min",
  "1.5mM_diamide_20min",
  "1.5mM_diamide_30min",
  "1.5mM_diamide_40min",
  "1.5mM_diamide_50min",
  "1.5mM_diamide_60min",
  "1.5mM_diamide_90min"
)

plot_models_line(models_all_minLambda, names_diamide_condition, 10)

select the period when gene expression effect is biggest with variance of log2 expression level

type of environmental stress period when gene expression effect is biggest (min)
heatshock 15 ~ 20
hypothermia 15, 60
sorbitol treatment 30 ~ 45
nitrogen depletion 2880 ~ 7200
diamide treatment 30 ~ 40
dithiothrietol exposure 20 ~ 60
menadione exposure 30 ~ 40, 80~120
hydrogen peroxide treatment 20 ~ 30
expressionLevel_matrix = as.matrix(expressionLevel_impute_knn[4:length(expressionLevel_Gasch)])
storage.mode(expressionLevel_matrix) <- "double"
expressionLevel_column_variance = colMeans(expressionLevel_matrix*expressionLevel_matrix)-colMeans(expressionLevel_matrix)*colMeans(expressionLevel_matrix)
expressionLevel_column_variance
##                           hs_05min_hs-1                           hs_10min_hs-1 
##                              0.59943471                              1.38397443 
##                           hs_15min_hs-1                           hs_20min_hs-1 
##                              1.85735216                              1.57825783 
##                           hs_30min_hs-1                           hs_40min_hs-1 
##                              1.33606173                              0.70025092 
##                           hs_60min_hs-1                           hs_80min_hs-1 
##                              0.29295634                              0.28982911 
##                           hs_00min_hs-2                         hs_00min_hs-2_1 
##                              0.39666954                              0.55651148 
##                         hs_00min_hs-2_2                           hs_05min_hs-2 
##                              0.51733475                              0.21095586 
##                           hs_15min_hs-2                           hs_30min_hs-2 
##                              0.56893119                              0.31925497 
##                           hs_60min_hs-2                         hs_37to25_15min 
##                              0.22050697                              0.76376153 
##                         hs_37to25_30min                         hs_37to25_45min 
##                              0.54930628                              0.51776686 
##                         hs_37to25_60min                         hs_37to25_90min 
##                              0.65374148                              0.36877961 
##                         hs_17to37_20min                         hs_21to37_20min 
##                              2.08758442                              1.80558232 
##                         hs_25to37_20min                         hs_29to37_20min 
##                              1.80510726                              1.22693871 
##                         hs_33to37_20min                         hs_29to33_05min 
##                              0.66359174                              0.65668803 
##                         hs_29to33_15min                         hs_29to33_30min 
##                              0.40350369                              0.36452591 
##                        33C_vs_30C_90min 29C(1M_sorbitol)~33C(1M_sorbitol)_05min 
##                              0.15461016                              0.31408115 
## 29C(1M_sorbitol)~33C(1M_sorbitol)_15min 29C(1M_sorbitol)~33C(1M_sorbitol)_30min 
##                              0.43273282                              0.21103672 
## 29C(1M_sorbitol)~33C(NO_sorbitol)_05min 29C(1M_sorbitol)~33C(NO_sorbitol)_15min 
##                              0.34761384                              0.26696581 
## 29C(1M_sorbitol)~33C(NO_sorbitol)_30min                  0.32mM_H2O2_10min_redo 
##                              0.24352288                              0.52921735 
##                  0.32mM_H2O2_20min_redo                  0.32mM_H2O2_30min_redo 
##                              0.95158817                              0.82296446 
##                0.32mM_H2O2_40min_rescan                  0.32mM_H2O2_50min_redo 
##                              0.64385692                              0.38455924 
##                  0.32mM_H2O2_60min_redo                  0.32mM_H2O2_80min_redo 
##                              0.27395173                              0.15458056 
##                 0.32mM_H2O2_100min_redo                 0.32mM_H2O2_120min_redo 
##                              0.16774095                              0.13646671 
##                 0.32mM_H2O2_160min_redo                1mM_Menadione_10min_redo 
##                              0.18783257                              0.19330670 
##                1mM_Menadione_20min_redo                1mM_Menadione_30min_redo 
##                              0.31401509                              0.33856436 
##                1mM_Menadione_40min_redo                1mM_Menadione_50min_redo 
##                              0.33818451                              0.24808122 
##                1mM_Menadione_80min_redo               1mM_Menadione_105min_redo 
##                              0.36923009                              0.29390984 
##               1mM_Menadione_120min_redo               1mM_Menadione_160min_redo 
##                              0.43195236                              0.27850858 
##                  2.5mM_DTT_005min_dtt-1                  2.5mM_DTT_015min_dtt-1 
##                              0.39654095                              0.27117130 
##                  2.5mM_DTT_030min_dtt-1                  2.5mM_DTT_045min_dtt-1 
##                              0.25578037                              0.29741841 
##                  2.5mM_DTT_060min_dtt-1                  2.5mM_DTT_090min_dtt-1 
##                              0.36201153                              0.37834877 
##                  2.5mM_DTT_120min_dtt-1                  2.5mM_DTT_180min_dtt-1 
##                              0.59946284                              0.66001517 
##                       dtt_000_min_dtt-2                       dtt_015_min_dtt-2 
##                              0.36246088                              0.31768601 
##                       dtt_030_min_dtt-2                       dtt_060_min_dtt-2 
##                              0.20834757                              0.18964943 
##                       dtt_120_min_dtt-2                       dtt_240_min_dtt-2 
##                              0.27733911                              0.51518324 
##                      1.5mM_diamide_5min                     1.5mM_diamide_10min 
##                              0.13484936                              0.20104724 
##                     1.5mM_diamide_20min                     1.5mM_diamide_30min 
##                              0.53513554                              0.81986742 
##                     1.5mM_diamide_40min                     1.5mM_diamide_50min 
##                              0.81360486                              0.68737603 
##                     1.5mM_diamide_60min                     1.5mM_diamide_90min 
##                              0.49479136                              0.45626746 
##                       1M_sorbitol_05min                       1M_sorbitol_15min 
##                              0.22092602                              0.10467921 
##                       1M_sorbitol_30min                       1M_sorbitol_45min 
##                              0.52708945                              0.28812671 
##                       1M_sorbitol_60min                       1M_sorbitol_90min 
##                              0.12880897                              0.05453072 
##                      1M_sorbitol_120min                      Hypo_osmotic_05min 
##                              0.07124918                              0.09408756 
##                      Hypo_osmotic_15min                      Hypo_osmotic_30min 
##                              0.15994152                              0.19981359 
##                      Hypo_osmotic_45min                      Hypo_osmotic_60min 
##                              0.27888201                              0.21068233 
##                steady_state_1M_sorbitol                           aa_starv_0.5h 
##                              0.15235905                              0.44528628 
##                             aa_starv_1h                             aa_starv_2h 
##                              1.08676210                              0.42173531 
##                             aa_starv_4h                             aa_starv_6h 
##                              0.25790671                              0.36535160 
##                Nitrogen_Depletion_30min                   Nitrogen_Depletion_1h 
##                              0.37007693                              0.86865142 
##                   Nitrogen_Depletion_2h                   Nitrogen_Depletion_4h 
##                              0.83167714                              0.43139278 
##                   Nitrogen_Depletion_8h                  Nitrogen_Depletion_12h 
##                              0.30260144                              1.13294769 
##                   Nitrogen_Depletion_1d                   Nitrogen_Depletion_2d 
##                              1.13950897                              1.48564752 
##                   Nitrogen_Depletion_3d                   Nitrogen_Depletion_5d 
##                              1.45227809                              1.67895218 
##             Diauxic_Shift_Timecourse_0h           Diauxic_Shift_Timecourse_9.5h 
##                              1.66781719                              0.06196574 
##          Diauxic_Shift_Timecourse_11.5h          Diauxic_Shift_Timecourse_13.5h 
##                              0.07718014                              0.09693789 
##          Diauxic_Shift_Timecourse_15.5h          Diauxic_Shift_Timecourse_18.5h 
##                              0.11677921                              0.14947639 
##          Diauxic_Shift_Timecourse_20.5h                            YPD_2h_ypd-2 
##                              0.66910229                              0.78025445 
##                            YPD_4h_ypd-2                            YPD_6h_ypd-2 
##                              0.20999644                              0.25458657 
##                            YPD_8h_ypd-2            YPD_10h_ypd-2\tYPD_12h_ypd-2 
##                              2.98596520                              2.03171997 
##                            YPD_1d_ypd-2                            YPD_2d_ypd-2 
##                              1.77040169                              1.78131648 
##                            YPD_3d_ypd-2                            YPD_5d_ypd-2 
##                              1.80172074                              1.90092954 
##           YPD_stationary_phase_2h_ypd-1           YPD_stationary_phase_4h_ypd-1 
##                              2.77374182                              3.66421762 
##           YPD_stationary_phase_8h_ypd-1          YPD_stationary_phase_12h_ypd-1 
##                              0.08823111                              0.13046111 
##           YPD_stationary_phase_1d_ypd-1           YPD_stationary_phase_2d_ypd-1 
##                              0.41311045                              2.14713165 
##           YPD_stationary_phase_3d_ypd-1           YPD_stationary_phase_5d_ypd-1 
##                              1.16361072                              1.35908241 
##           YPD_stationary_phase_7d_ypd-1          YPD_stationary_phase_13d_ypd-1 
##                              1.63347377                              1.98535985 
##          YPD_stationary_phase_22d_ypd-1          YPD_stationary_phase_28d_ypd-1 
##                              2.48011188                              2.20702171 
##                       DBY7286_37C_20min                     DBYmsn2-4_37C_20min 
##                              2.31377710                              2.26458321 
##         DBYmsn2/4_real_strain_37C_20min             DBYyap1-_37Cheat_20min_redo 
##                              0.46630794                              0.38657534 
##                      DBYyap1_37C_repeat                DBY7286_0.3mM_H2O2_20min 
##                              0.26413947                              0.49123854 
##     DBYmsn2msn4_good_strain_0.32mM_H2O2 DBYmsn2/4_real_strain_0.32mM_H2O2_20min 
##                              0.42965633                              0.67425340 
##               DBYyap1-_0.3mM_H2O2_20min               DBYyap1_0.32mM_H2O2_20min 
##                              0.58011760                              0.46806122 
##              Msn2_overexpression_repeat                     Msn4_overexpression 
##                              0.90376303                              0.73814894 
##                     YAP1_overexpression         ethanol_vs_reference_pool_car-1 
##                              0.09814771                              0.23323315 
##        galactose_vs_reference_poolcar-1         glucose_vs_reference_pool_car-1 
##                              0.35308661                              0.46603230 
##         mannose_vs_reference_pool_car-1       raffinose_vs_reference_pool_car-1 
##                              0.34135504                              0.55743405 
##         sucrose_vs_reference_pool_car-1      YP_ethanol_vs_reference_pool_car-2 
##                              0.33785304                              0.25483255 
##     YP_fructose_vs_reference_pool_car-2    YP_galactose_vs_reference_pool_car-2 
##                              0.42886206                              0.42919117 
##      YP_glucose_vs_reference_pool_car-2      YP_mannose_vs_reference_pool_car-2 
##                              0.24378256                              0.25566584 
##    YP_raffinose_vs_reference_pool_car-2      YP_sucrose_vs_reference_pool_car-2 
##                              0.43660972                              0.33361483 
##                         17C_growth_ct-1                         21C_growth_ct-1 
##                              0.31684865                              0.25733671 
##                         25C_growth_ct-1                         29C_growth_ct-1 
##                              0.23098183                              0.18425414 
##                         37C_growth_ct-1                   steady_state_15C_ct-2 
##                              0.14651604                              0.14009110 
##                   steady_state_17C_ct-2                   steady_state_21C_ct-2 
##                              0.21396154                              0.16963686 
##                   steady_state_25C_ct-2                   steady_state_29C_ct-2 
##                              0.24082752                              0.15342013 
##                   steady_state_33C_ct-2                   steady_state_36C_ct-2 
##                              0.13162457                              0.11358531 
##        steady_state_36C_ct-2_repeat_hyb              steady state 36 dec C ct-2 
##                              0.20331521                              0.16291409 
## steady state 36 dec C ct-2 (repeat hyb) 
##                              0.16075677

Part 4: analyze the regression coefficient

divide environmental condition into groups

With the models from group lasso, we can carefully select the regression coefficient from different groups of environmental conditions and compare them. By calculating the sum and L2 norm of their coefficient under each group of environmental condition, we are able to compare how the regulatory effect of each motif differ in each environmental stress.

names_heatshock_29to33_condition = c(
  "hs_29to33_05min",
  "hs_29to33_15min",
  "hs_29to33_30min"
)

names_heatshock_sorbitol_condition = c(
  "29C(1M_sorbitol)~33C(1M_sorbitol)_05min",
  "29C(1M_sorbitol)~33C(1M_sorbitol)_15min",
  "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
)

names_hypothermia_condition = c(
  "hs_37to25_15min",
  "hs_37to25_30min",
  "hs_37to25_60min"
)

names_heatshock_25to37_condition = c(
  "hs_15min_hs-1",
  "hs_30min_hs-1",
  "hs_60min_hs-1"
)
summary_motif_frequency = data.frame(matrix(ncol= 2, nrow = 0))

for(motif in names_motifs_valid){
  summary_motif_frequency = rbind(summary_motif_frequency, data.frame(
    motif = motif,
    transcriptAmount = log10(sum(fullTable_Gasch[[motif]] > 0))
  ))
}
summary_coefficient = merge(x = summary_coefficient, y = summary_motif_frequency, by = "motif")

L2 norm of the non-zero regression coefficient from significant motif under all condition

The following graph shows the L2 norm of the regression coefficient of some significant motif under all environmental conditions. The rank of each motif shows how significant a motif is in the environmental conditions we considered. However, since the models we fitted here only considered some typical environmental stress on yeast and is very unbalanced in the type of condition (about 40 model is temperature relevant), we couldn’t say the high-rank motif shown here is significant under general environmental stress.

compare L2 nore and sum of the non-zero regression coefficient from significant motif under heat shock and hypothermia

the regulatory effect of each motif under hypothermia is much smaller than that under heat shock. This could probably reflect the trigger of environmental stress response (ESR) mention by Gasch et al. (2000): “the ESR is not initiated in response to all environmental changes and that the large, transient changes in expression that are characteristic of the ESR are only seen when this response is initiated and not in the reciprocal response to diminished environmental stress.”

compare L2 norm and sum of the non-zero regression coefficient from significant motif under heat shock, heat shock with sorbitol

Although most of the regression coefficients are similar in both heat shock with and without sorbitol, there are some motifs show a relatively different regulatory effect on gene expression with sorbitol. Check the second graph Sum of the non-zero regression coefficient of significant motifs under heat shock with and without sorbitol after 5, 15 and 30 min, it is easy to see TGAGGCTA shows a much stronger regulatory effect on gene expression when there is sorbitol while UKWCGRGGN and GTAGTATTT acting in an opposite way, a much weaker regulatory effect under heat shock with sorbitol.

investigate the transcript frequency of each motif against their coefficient

order_motifs = summary_coefficient[summary_coefficient$condition == "all",]
order_motifs = order_motifs[order(-order_motifs$transcriptAmount),]$motif

box graph for cross comparison between two type of environmental stress

To compare how the coefficient of motif from two different type of environmental stress, we plot a box graph show the difference between the coefficients from two group of models.

names_heatshock_29to33_condition = c(
  "hs_29to33_05min",
  "hs_29to33_15min",
  "hs_29to33_30min"
)

names_heatshock_sorbitol_condition = c(
  "29C(1M_sorbitol)~33C(1M_sorbitol)_05min",
  "29C(1M_sorbitol)~33C(1M_sorbitol)_15min",
  "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
)

names_heatshock_25to37_condition = c(
  "hs_05min_hs-1",
  "hs_10min_hs-1",
  "hs_15min_hs-1",
  "hs_20min_hs-1",
  "hs_30min_hs-1",
  "hs_40min_hs-1",
  "hs_60min_hs-1",
  "hs_80min_hs-1"
)

names_heatshock_37to25_condition = c(
  "hs_37to25_15min",
  "hs_37to25_30min",
  "hs_37to25_45min",
  "hs_37to25_60min",
  "hs_37to25_90min"
)

comparison motifs rank between each group of models

As another approach from comparing the motifs’ regulatory effect between different type of environmental stresses, we compare the absolute value of their regression coefficients. Motifs are ranked in each model where rank 1 is given to the motifs with highest absolute value of coefficient. Then, for each group of environmental stress, we compute the average rank of each motif weighted by L2 of the coefficients in each model. Therefore, the rank from a model with motifs with low regression coefficients will get less weight rather than that of a model with motifs with high regression coefficients.

# prepare data
summary_coefficient_rank = data.frame(matrix(ncol = 4, nrow = 0))

for(environmental_stress in names_environmental_stress){
  coefficients = extract_coefficient(models_all_minLambda,environmental_stress)
  #coefficients = coefficients[order(abs(coefficients$value)),]
  coefficients = coefficients[order(-abs(coefficients$value)),]

  i = 1
  while (i <= dim(coefficients)[1]) {
    summary_coefficient_rank = rbind(summary_coefficient_rank, data.frame(
    motif = coefficients[i]$rn,
    rank = i,
    condition = environmental_stress,
    model.L2 = sqrt(sum(coefficients$value^2))
    ))
    i = i+1
  }
}
{
  names_heatshock_25to37_condition = vector(mode="character", length=8)
  names_heatshock_25to37_condition[1] = "hs_05min_hs-1"
  names_heatshock_25to37_condition[2] = "hs_10min_hs-1"
  names_heatshock_25to37_condition[3] = "hs_15min_hs-1"
  names_heatshock_25to37_condition[4] = "hs_20min_hs-1"
  names_heatshock_25to37_condition[5] = "hs_30min_hs-1"
  names_heatshock_25to37_condition[6] = "hs_40min_hs-1"
  names_heatshock_25to37_condition[7] = "hs_60min_hs-1"
  names_heatshock_25to37_condition[8] = "hs_80min_hs-1"
}

{
  names_heatshock_37to25_condition = vector(mode="character", length=5)
  names_heatshock_37to25_condition[1] = "hs_37to25_15min"
  names_heatshock_37to25_condition[2] = "hs_37to25_30min"
  names_heatshock_37to25_condition[3] = "hs_37to25_45min"
  names_heatshock_37to25_condition[4] = "hs_37to25_60min"
  names_heatshock_37to25_condition[5] = "hs_37to25_90min"
}

{
  names_heatshock_sorbitol_condition = vector(mode="character", length=3)
  names_heatshock_sorbitol_condition[1] = "29C(1M_sorbitol)~33C(1M_sorbitol)_05min"
  names_heatshock_sorbitol_condition[2] = "29C(1M_sorbitol)~33C(1M_sorbitol)_15min"
  names_heatshock_sorbitol_condition[3] = "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
}

{
  names_sorbitol_condition = vector(mode="character", length=7)
  names_sorbitol_condition[1] = "1M_sorbitol_05min"
  names_sorbitol_condition[2] = "1M_sorbitol_15min"
  names_sorbitol_condition[3] = "1M_sorbitol_30min"
  names_sorbitol_condition[4] = "1M_sorbitol_45min"
  names_sorbitol_condition[5] = "1M_sorbitol_60min"
  names_sorbitol_condition[6] = "1M_sorbitol_90min"
  names_sorbitol_condition[7] = "1M_sorbitol_120min"
}

{
  names_Nitrogen_condition = vector(mode="character", length=7)
  names_Nitrogen_condition[1] = "Nitrogen_Depletion_30min"
  names_Nitrogen_condition[2] = "Nitrogen_Depletion_1h"
  names_Nitrogen_condition[3] = "Nitrogen_Depletion_2h"
  names_Nitrogen_condition[4] = "Nitrogen_Depletion_4h"
  names_Nitrogen_condition[5] = "Nitrogen_Depletion_8h"
  names_Nitrogen_condition[6] = "Nitrogen_Depletion_12h"
  names_Nitrogen_condition[7] = "Nitrogen_Depletion_1d"
}

{
  names_diamide_condition = vector(mode="character", length=8)
  names_diamide_condition[1] = "1.5mM_diamide_5min"
  names_diamide_condition[2] = "1.5mM_diamide_10min"
  names_diamide_condition[3] = "1.5mM_diamide_20min"
  names_diamide_condition[4] = "1.5mM_diamide_30min"
  names_diamide_condition[5] = "1.5mM_diamide_40min"
  names_diamide_condition[6] = "1.5mM_diamide_50min"
  names_diamide_condition[7] = "1.5mM_diamide_60min"
  names_diamide_condition[8] = "1.5mM_diamide_90min"
}




{
  names_dithiothrietol_condition = vector(mode="character", length=8)
  names_dithiothrietol_condition[1] = "2.5mM_DTT_005min_dtt-1"
  names_dithiothrietol_condition[2] = "2.5mM_DTT_015min_dtt-1"
  names_dithiothrietol_condition[3] = "2.5mM_DTT_030min_dtt-1"
  names_dithiothrietol_condition[4] = "2.5mM_DTT_045min_dtt-1"
  names_dithiothrietol_condition[5] = "2.5mM_DTT_060min_dtt-1"
  names_dithiothrietol_condition[6] = "2.5mM_DTT_090min_dtt-1"
  names_dithiothrietol_condition[7] = "2.5mM_DTT_120min_dtt-1"
  names_dithiothrietol_condition[8] = "2.5mM_DTT_180min_dtt-1"
}

{
  names_meanadione_condition = vector(mode="character", length=8)
  names_meanadione_condition[1] = "1mM_Menadione_10min_redo"
  names_meanadione_condition[2] = "1mM_Menadione_20min_redo"
  names_meanadione_condition[3] = "1mM_Menadione_30min_redo"
  names_meanadione_condition[4] = "1mM_Menadione_40min_redo"
  names_meanadione_condition[5] = "1mM_Menadione_50min_redo"
  names_meanadione_condition[6] = "1mM_Menadione_80min_redo"
  names_meanadione_condition[7] = "1mM_Menadione_120min_redo"
  names_meanadione_condition[8] = "1mM_Menadione_160min_redo"
}

{
  names_H2O2_condition = vector(mode="character", length=10)
  names_H2O2_condition[1] = "0.32mM_H2O2_10min_redo"
  names_H2O2_condition[2] = "0.32mM_H2O2_20min_redo"
  names_H2O2_condition[3] = "0.32mM_H2O2_30min_redo"
  names_H2O2_condition[4] = "0.32mM_H2O2_40min_rescan"
  names_H2O2_condition[5] = "0.32mM_H2O2_50min_redo"
  names_H2O2_condition[6] = "0.32mM_H2O2_60min_redo"
  names_H2O2_condition[7] = "0.32mM_H2O2_80min_redo"
  names_H2O2_condition[8] = "0.32mM_H2O2_100min_redo"
  names_H2O2_condition[9] = "0.32mM_H2O2_120min_redo"
  names_H2O2_condition[10] = "0.32mM_H2O2_160min_redo"
}
order the motif
plot graph

## Warning: Removed 4 rows containing missing values (geom_bar).

investigating bias and expression mean of each model

The bias of the model can be considered as the expected gene expression level change of a gene with no cis-regulatory element we considered in the model. In other words, the bias indicates the expression level change that cannot be explained by the model and could indicate the estimated proportion of the efforts on regulation these cis-regulatory elements make.

Select motif with weighted square L2

In order to focus only on the significant motifs, we calculate the variance predict by motif for each motif over all the conditions. In this case, we consider no only the regression coefficient but also the transcript frequence: $$ \ \ \[2ex] e_{mg}= \[\begin{cases} 1,\quad \text{at least one motif } m \text{ is on gene }g\\ 0, \quad otherwise \end{cases}\]

\[6ex]

\[\begin{align} \text{Variance pridict by motif} &= \sum_c\sum_g \beta _{mc}^2 \times e_{mg}\\ &= \beta^2_{mc} \times transcript\ frequency \\ &= L_2^2\times transcript\ frequency \end{align}\]

$$

group conditions
{
  names_heatshock_29to33_condition = vector(mode="character", length=3)
  names_heatshock_29to33_condition[1] = "hs_29to33_05min"
  names_heatshock_29to33_condition[2] = "hs_29to33_15min"
  names_heatshock_29to33_condition[3] = "hs_29to33_30min"
}


{
  names_heatshock_sorbitol_condition = vector(mode="character", length=3)
  names_heatshock_sorbitol_condition[1] = "29C(1M_sorbitol)~33C(1M_sorbitol)_05min"
  names_heatshock_sorbitol_condition[2] = "29C(1M_sorbitol)~33C(1M_sorbitol)_15min"
  names_heatshock_sorbitol_condition[3] = "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
}


{
  names_heatshock_25to37_condition = vector(mode="character", length=8)
  names_heatshock_25to37_condition[1] = "hs_05min_hs-1"
  names_heatshock_25to37_condition[2] = "hs_10min_hs-1"
  names_heatshock_25to37_condition[3] = "hs_15min_hs-1"
  names_heatshock_25to37_condition[4] = "hs_20min_hs-1"
  names_heatshock_25to37_condition[5] = "hs_30min_hs-1"
  names_heatshock_25to37_condition[6] = "hs_40min_hs-1"
  names_heatshock_25to37_condition[7] = "hs_60min_hs-1"
  names_heatshock_25to37_condition[8] = "hs_80min_hs-1"
}


{
  names_heatshock_37to25_condition = vector(mode="character", length=5)
  names_heatshock_37to25_condition[1] = "hs_37to25_15min"
  names_heatshock_37to25_condition[2] = "hs_37to25_30min"
  names_heatshock_37to25_condition[3] = "hs_37to25_45min"
  names_heatshock_37to25_condition[4] = "hs_37to25_60min"
  names_heatshock_37to25_condition[5] = "hs_37to25_90min"
}


{
  names_sorbitol_condition = vector(mode="character", length=7)
  names_sorbitol_condition[1] = "1M_sorbitol_05min"
  names_sorbitol_condition[2] = "1M_sorbitol_15min"
  names_sorbitol_condition[3] = "1M_sorbitol_30min"
  names_sorbitol_condition[4] = "1M_sorbitol_45min"
  names_sorbitol_condition[5] = "1M_sorbitol_60min"
  names_sorbitol_condition[6] = "1M_sorbitol_90min"
  names_sorbitol_condition[7] = "1M_sorbitol_120min"
}


{
  names_Nitrogen_condition = vector(mode="character", length=7)
  names_Nitrogen_condition[1] = "Nitrogen_Depletion_30min"
  names_Nitrogen_condition[2] = "Nitrogen_Depletion_1h"
  names_Nitrogen_condition[3] = "Nitrogen_Depletion_2h"
  names_Nitrogen_condition[4] = "Nitrogen_Depletion_4h"
  names_Nitrogen_condition[5] = "Nitrogen_Depletion_8h"
  names_Nitrogen_condition[6] = "Nitrogen_Depletion_12h"
  names_Nitrogen_condition[7] = "Nitrogen_Depletion_1d"
}

{
  names_diamide_condition = vector(mode="character", length=8)
  names_diamide_condition[1] = "1.5mM_diamide_5min"
  names_diamide_condition[2] = "1.5mM_diamide_10min"
  names_diamide_condition[3] = "1.5mM_diamide_20min"
  names_diamide_condition[4] = "1.5mM_diamide_30min"
  names_diamide_condition[5] = "1.5mM_diamide_40min"
  names_diamide_condition[6] = "1.5mM_diamide_50min"
  names_diamide_condition[7] = "1.5mM_diamide_60min"
  names_diamide_condition[8] = "1.5mM_diamide_90min"
}



{
  names_dithiothrietol_condition = vector(mode="character", length=7)
  names_dithiothrietol_condition[1] = "2.5mM_DTT_005min_dtt-1"
  names_dithiothrietol_condition[2] = "2.5mM_DTT_015min_dtt-1"
  names_dithiothrietol_condition[3] = "2.5mM_DTT_030min_dtt-1"
  names_dithiothrietol_condition[4] = "2.5mM_DTT_045min_dtt-1"
  names_dithiothrietol_condition[5] = "2.5mM_DTT_060min_dtt-1"
  names_dithiothrietol_condition[6] = "2.5mM_DTT_090min_dtt-1"
  names_dithiothrietol_condition[7] = "2.5mM_DTT_120min_dtt-1"
  names_dithiothrietol_condition[8] = "2.5mM_DTT_180min_dtt-1"
}



{
  names_meanadione_condition = vector(mode="character", length=9)
  names_meanadione_condition[1] = "1mM_Menadione_10min_redo"
  names_meanadione_condition[2] = "1mM_Menadione_20min_redo"
  names_meanadione_condition[3] = "1mM_Menadione_30min_redo"
  names_meanadione_condition[4] = "1mM_Menadione_40min_redo"
  names_meanadione_condition[5] = "1mM_Menadione_50min_redo"
  names_meanadione_condition[6] = "1mM_Menadione_80min_redo"
  names_meanadione_condition[7] = "1mM_Menadione_105min_redo"
  names_meanadione_condition[8] = "1mM_Menadione_120min_redo"
  names_meanadione_condition[9] = "1mM_Menadione_160min_redo"
}

{
  names_H2O2_condition = vector(mode="character", length=9)
  names_H2O2_condition[1] = "0.32mM_H2O2_10min_redo"
  names_H2O2_condition[2] = "0.32mM_H2O2_20min_redo"
  names_H2O2_condition[3] = "0.32mM_H2O2_30min_redo"
  names_H2O2_condition[4] = "0.32mM_H2O2_40min_rescan"
  names_H2O2_condition[5] = "0.32mM_H2O2_50min_redo"
  names_H2O2_condition[6] = "0.32mM_H2O2_60min_redo"
  names_H2O2_condition[7] = "0.32mM_H2O2_80min_redo"
  names_H2O2_condition[8] = "0.32mM_H2O2_120min_redo"
  names_H2O2_condition[9] = "0.32mM_H2O2_160min_redo"
}
manage plotting order
plotting the weighted square L2 of each motif

select motifs

We selected the 6 motifs with the highest weighted L2 value. They should be reliable significant motifs under the environmental conditions we considered.

names_motifs_significant = head(order_motifs,6)

Compare mean of L2 of significant motifs across different group of environmental stress

First, We align the models by only considering the model with time slices within 5min to 60mins. For each group of environmental stress, we calculate the mean of L2 to allow cross comparison across environmental stresses group with different number of models:

\[ mean\ L_2 = \sqrt{{\sum_c^N \beta _{mc}^2}\over \text{N}} \]

compare proportion of variance explain by each motif

However, the overall regulatory effect in a cell also depends on how strong the environmental stress is. For example, the change in expression level under heat shock from 25 degree to 37 degree should be much larger than that under heat shock from 29 degree to 33 degree. Therefore, we calculate the proportion of variance explain by motifs over the total variance to reduce such difference

We plot bar graphs to compare the value across each group of environmental stresses. The shape of the proportion variance graph is similar to that of the mean of L2. There is still a large difference between the value of heat shock from 29C to 33C and heat shock from 25C to 37C.

\[ \begin{align} \text{Variance pridict by motif} &= \sum_c({1\over G}\times \sum_g \beta _{mc}^2 \times e_{mg} - ({1\over G}\times\sum_g \beta _{mc} \times e_{mg})^2)\\ &= \sum_c (\beta^2_{mc}\ \times\ percentage\ of\ transcript\ frequence- (\beta_{mc} \times percentage\ of\ transcript\ frequence)^2) \\ \end{align} \]

\[ \text{proportion of variance explain by motif } = {\text{variance explain by motifs}\over \text{total variance}} \]

compute the total variance of expression level in each group

significant test with linear model

To evaluate the reliability of the regression coefficient of significant motifs, we fit a linear model with selected motifs and investigate their regression coefficient and standard error. According to previous figures, we apply the gene expression levels 15 or 20 mins after specific environmental stress when the regulatory effect of motifs is strong. We fitted 10 models in total and plot their coefficient and standard error. The error bars represent 1 standard error.

 names_motifs_significant = c(
  'UGUAHMNUA',
  'ATATTC',
  'TGTATAWT',
  'TGTAAATA',
  'WUUGUAWUWU',
  'ATACGGTGT'
)
heat shock from 29C to 33C

heat shock from 29C to 33C with sorbitol

heat shock from 25C to 37C

hypothermia from 37C to 25C

sorbitol

nitrogen

dianide

dithiothriethol

menadione

H2O2

coefficients

## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_errorbar).